What is Regression?
Regression is a foundational tool in statistics and machine learning, used to quantify relationships between variables. At its core, regression asks a simple question:
How does one quantity change when another does?
Imagine you’re a biologist estimating how an animal’s metabolic rate scales with its body mass. Regression provides a principled framework for translating observed data into a mathematical description, one that not only fits the past but can also forecast the future.
Regression models work on outputs that are a continuous quantity i.e can take infinitely many values. In the given example the metabolic rate can take a continuous range of values
It is classified under Supervised Machine Learning and we have two sets of variables present in the model :
Target : The variable we are trying to predict. e.g animal’s metabolic rate
Features : Input variables that the target depends on. e.g body mass
In practice, a model will be trained on labelled data to understand the relationship between data features and the dependent variable. By estimating this relationship, the model can predict the outcome of new and unseen data.
Common use for machine learning regression models include:
Retail Demand Forecasting: Predict future product sales from past sales and pricing
House Price Valuation: Estimate property values using features like size, location etc.
Dose–Response Modeling: Relate drug dose (or exposure level) to biological effect.
Energy Load Prediction: Predict electricity consumption from weather and time‑of‑day.
Environmental Trend Analysis: Quantify climate or population changes over time.
Marketing Spend Optimization: Allocate ad budgets by linking spend to revenue outcomes.
Machine learning offers a variety of methods for tackling regression problems. These popular algorithms differ in how many predictors they incorporate, the kinds of data they can handle, and the assumed form of the relationship between inputs and outputs.
For instance, linear regression methods presuppose a straight‐line connection between the independent and dependent variables, while polynomial regression can assume more complex relationships.
Even though there are many different types of Regression algorithms the introductory and basic ones are as follows:
Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
It is the mother of all Machine Learning algorithms as put down the basis for further advancements.
Linear regression is a statistical technique for modeling the relationship between one continuous outcome (the target) and one or more continuous or categorical inputs (the features). It assumes that the expected value of the target can be expressed as a linear combination of the features. In practice, it fits a “best‐fit” line (in two dimensions) or hyperplane (in higher dimensions) through the data by estimating an intercept term and one coefficient per feature. Once fitted, the model lets you quantify how each input influences the output and make predictions for new input values.
Simple linear regression models the relationship between a single feature \(x\) and a target \(y\) by fitting the best straight line through the data. Formally, it assumes:
\(y\ = wx + b\ + \varepsilon\)
ε captures random deviation around the line. By choosing \(w\) and \(b\) appropriately, simple linear regression provides both an interpretable summary of how one feature influences the target and a straightforward rule for predicting y from new values of \(x\).
Our model is \(f_{w,b}(x)\ = \ wx\ + b\) This predicts the target values
We define a function called the cost function to quantify the prediction errors
The Cost function is defined as \(J(w,b) = \ \frac{1}{2m}\sum_{i = 1}^{m}{(f_{w,b}(x^{(i)}) - y^{(i)})}^{2}\)
Where m is the number of data points.
As we can see this is basically the Mean of the square(MSE) of the errors in prediction
The factor of two along with the m in the denominator is just for mathematical convenience later on.
In the case of simple linear regression the cost function is a function of two variables- w and b
The mathematical result comes from maximum likelihood estimation
But from a intuitive perspective its a function that penalises large errors,has convexity and is diffrentiable everywhere
To minimize the cost function in simple linear regression, gradient descent is the most common and effective method. It iteratively adjusts the model's parameters to reduce the cost.
It works by calculating the gradient (direction of the steepest increase) of the cost function and then taking a step in the opposite direction (the direction of the steepest decrease).
The mathematical formulation goes as follows
At each step of gradient descent we update w,b
Repeat until convergence {
\(w\ = \ w\ - \alpha\frac{\partial J(w,b)}{\partial w}\)
\(b\ = \ b\ - \ \alpha\frac{\partial J(w,b)}{\partial b}\)
}
Where \(\alpha\) is the learning rate of the algorithm and should not be too high or too low.
It is important to carry out the updates simultaneously and not one after the other in one step
There are many different variations of gradient descent that converge faster which will be covered in upcoming articles.
After the the gradient descent has converged the model has reached its optimum parameter values
A visualization of a simple linear regression model fitting to the dataset
Multiple linear regression extends simple linear regression to model a target y using multiple input features \(x\ = (x_{1},x_{2},.....,x_{n})\). It assumes that the expected target is a linear combination of those features plus a constant bias term. In the familiar ML notation:
\(f_{w,b}(x) = \ w_{1}x_{1} + \ w_{2}x_{2} + \ ...... + \ w_{n}x_{n} + b\)
\(w\ = \ \)(\(w_{1}\),\(w_{2}\),.......,\(w_{n}\))
It is now convenient to replace the weights and features with their vectors having the collection of weights this allows us to present the equation as
\(f_{w,b}(x)\ = \ w \cdot x\ + b\)
Model evaluation and training
We use the same cost function as the cost function as we did for Simple Linear Regression
Just with the updated definition of \(f_{w,b}(x)\ \)
The gradient descent algorithm also has the same execution except now we have more than one weight and we need to update all of them in every epoch
\(\frac{\partial J(w,b)}{\partial w}\) is now a vector of derivatives of the cost function with respect to each of the \(w_{i}\)
And not a single derivative value
Obviously \(\frac{\partial J(w,b)}{\partial b}\) remains a scalar and is the bias’ partial derivative
Using the same steps as given above we can simultaneously update the vector w and the bias b
The derivative \(\frac{\partial J(w,b)}{\partial w_{j}}\) is given as \(\frac{1}{m}\sum_{i = 1}^{m}{(f_{w,b}(x^{(i)}) - y^{(i)})x_{j}^{(i)}}^{}\) , \(j \in \lbrack 1,n\rbrack\)
And \(\frac{\partial J(w,b)}{\partial b}\) is given as \(\frac{1}{m}\sum_{i = 1}^{m}{(f_{w,b}(x^{(i)}) - y^{(i)})}^{}\)
\(\)
Visualization of multiple linear regression with 2 features fitting the dataset
Turns out, if we use linear algebra we can actually solve for a predictive model instead of using an iterative solution.
Advantages of Linear Regression
Simplicity: Easy to understand and implement
Interpretability: Coefficients directly show each
feature’s impact on the outcome.
Speed: Training and prediction are computationally
lightweight.
Low Overfitting Risk: Tends to generalize well with few
features.
Foundation for Extensions:Forms the basis for GLMs,
polynomial regression, and kernels.
Convex Optimization: Guarantees finding the global
minimum of the loss function.
Linearity Assumption: Assumes a straight‑line (or
hyperplane) relationship; fails if true pattern is nonlinear.
Sensitivity to Outliers: Squared‑error loss heavily
penalizes large residuals, so a few outliers can skew the fit.
Collinearity Issues: Highly correlated features inflate
coefficient variances, leading to unstable estimates.
Limited Expressiveness: Cannot capture complex patterns
or interactions unless manually engineered into features.
For the fit to work optimally the model assumes the following:
Linearity: The expected target is a linear function of the features. Independent Errors: The residuals are uncorrelated with each other. Homoscedasticity: All residuals have the same variance across predictions. Zero‑Mean Errors: The average of the residuals is zero. Normality of Errors (for inference): Residuals follow a normal distribution. No Omitted Variable Bias: All relevant confounders are included so errors aren’t correlated with features. Correct Functional Form: The model uses appropriate transformations so that the linear specification holds.
Sometimes the data has non linear relations with the target that go undetected by a simple linear model. Polynomial Regression seeks to solve this problem.
Polynomial regression is a technique for modeling a nonlinear relationship between features and a target y by fitting a polynomial function of the features. Instead of a straight line, it assumes:
\(f_{w,b}(x) = \ w_{1}x + \ w_{2}x^{2} + \ ...... + \ w_{n}x^{n} + b\)
where n is the chosen polynomial degree. By treating each power of x as its own feature, polynomial regression remains a linear model in the weights \(w_{j}\ \)but can capture curved patterns in the data Of course, we can choose multiple types of features \(x_{1},x_{2},x_{3}\) etc and their polynomial features all in one but here we show only one feature for simplicity. One can consider it basically to be multiple linear regression with the additional features just being polynomial version of the existing ones Hence cost function calculation,gradient descent and implementation are just like multiple linear regression with the extra features being polynomial degrees of the current features
How varying the degree of the polynomial can go from underfitting to overfitting
Captures Nonlinearity: Models curved relationships
by adding higher-order terms.
Linear in Parameters: Still solved via OLS, with
closed-form or convex optimization.
Flexible Fit: Degree d controls complexity, allowing
close fit to varied data patterns.
Interpretable Extensions: Each coefficient \(w_{j}\ \)has a clear meaning for the \(x^{j}\ \)term.
Easy Implementation: Use PolynomialFeatures in
scikit-learn to generate powers and feed into any linear model.
Overfitting Risk: High degree can fit noise, leading
to poor generalization.
Degree Selection: Choosing d is nontrivial too low
underfits, too high overfits.
Multicollinearity: Higher-order features tend to be
highly correlated, inflating variance of estimates.
Interpretability Loss: As d grows, the model’s shape
becomes harder to intuitively explain.
Implementing and Improving Polynomial Regression
We can improve this algorithm by tuning the degree on a cross validation set and using the best results
Also, to prevent the problem of overfitting , regularization should be incorporated in the model.
Check out these videos by StatQuest and the whole channel in general :
The Main Ideas of Fitting a Line to Data (The Main Ideas of Least Squares and Linear Regression.)
Linear Regression, Clearly Explained!!!
Multiple Regression, Clearly Explained!!!
Or this playlist by CampusX :
Linear Regression Part 1 - Introduction
Polynomial Regression | Machine Learning
Scikit Learn Linear Regression to refer to while coding
CS229 notes for mathematical explanations of the above topics